Show code
# import libraries
import numpy as np
import pandas as pd
import polars as pl
import duckdb as db
import os
from dotenv import load_dotenv
load_dotenv('.env')
import plotly.express as px
from great_tables import GT, mdManipulating and Visualizing User Patterns with Polars
Jesus L. Monroy
May, 2025
This project explores a dataset of Spotify music listening habits, leveraging the power of the Polars DataFrame library for efficient and scalable analysis. By employing Polars’ optimized data manipulation capabilities, we efficiently process and analyze Spotify data, extracting meaningful insights into user behavior and the dynamics of the music ecosystem. Furthermore, this work highlights the potential of Polars as a powerful tool for large-scale data analysis in the music industry and beyond.
This study conducts a historical analysis of Spotify data, leveraging the efficiency of Polars, the analytical power of DuckDB, and the visualization capabilities of Plotly. By combining these tools, we examine trends in music consumption, artist popularity, and playlist dynamics over time.
Polars facilitates rapid data manipulation and cleaning of large Spotify datasets, while DuckDB enables complex SQL queries for in-depth analysis.
Plotly is then employed to create interactive visualizations, revealing patterns and insights that would be difficult to discern from raw data alone.
This analysis explores the evolution of musical genres, the impact of algorithmic recommendations, and the shifting landscape of artist distribution.
We demonstrate the efficacy of this toolchain for extracting meaningful narratives from Spotify’s extensive historical data, providing a comprehensive understanding of the platform’s influence on the music industry.
Duckdb is a powerful tool for data analysts and developers who need to perform fast and efficient analytical queries on large datasets, especially in environments where simplicity and portability are crucial.Polars is a modern DataFrame library designed for speed and efficiency, offering a compelling alternative to traditional data manipulation tools.Schema([('spotify_track_uri', String),
('ts', Datetime(time_unit='us', time_zone=None)),
('platform', String),
('ms_played', Int64),
('track_name', String),
('artist_name', String),
('album_name', String),
('reason_start', String),
('reason_end', String),
('shuffle', Boolean),
('skipped', Boolean)])
(
GT(df.head())
.tab_header(title=md("### Spotify Songs History Dataset"))
.fmt_date(columns='ts', date_style='iso')
.fmt_number(columns='ms_played', decimals=0)
.cols_hide(columns='year')
.cols_label(
ts='Date',
platform='Platform',
ms_played='Mins played',
track_name='Song',
artist_name='Artist',
album_name='Album',
reason_start='Reason start',
reason_end='Reason end',
shuffle='Shuffle',
skipped='Skipped',
)
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: https://mavenanalytics.io')
)Spotify Songs History Dataset |
|||||||||
|---|---|---|---|---|---|---|---|---|---|
| Date | Platform | Mins played | Song | Artist | Album | Reason start | Reason end | Shuffle | Skipped |
| 2013-07-08 | web player | 3,185 | Say It, Just Say It | The Mowgli's | Waiting For The Dawn | autoplay | clickrow | false | false |
| 2013-07-08 | web player | 61,865 | Drinking from the Bottle (feat. Tinie Tempah) | Calvin Harris | 18 Months | clickrow | clickrow | false | false |
| 2013-07-08 | web player | 285,386 | Born To Die | Lana Del Rey | Born To Die - The Paradise Edition | clickrow | unknown | false | false |
| 2013-07-08 | web player | 134,022 | Off To The Races | Lana Del Rey | Born To Die - The Paradise Edition | trackdone | clickrow | false | false |
| 2013-07-08 | web player | 0 | Half Mast | Empire Of The Sun | Walking On A Dream | clickrow | nextbtn | false | false |
| Source: https://mavenanalytics.io | |||||||||
The data set has 149,860 rows.
The dataset goes from 2013 up to 2024.
For further information about the fields of the dataset see Table 7
(
GT(tops)
.tab_header(title=md("### Unique Artists, Albums and Songs Played by Year"))
.fmt_number(columns=['artist_name','album_name','track_name'], decimals=0)
.cols_label(
year='Year',
artist_name='Artists',
album_name='Albums',
track_name='Songs',
)
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Unique Artists, Albums and Songs Played by Year |
|||
|---|---|---|---|
| Year | Artists | Albums | Songs |
| 2013 | 63 | 95 | 149 |
| 2014 | 21 | 22 | 23 |
| 2015 | 610 | 924 | 1,357 |
| 2016 | 458 | 940 | 2,535 |
| 2017 | 647 | 1,357 | 3,326 |
| 2018 | 439 | 947 | 2,917 |
| 2019 | 493 | 1,029 | 2,934 |
| 2020 | 820 | 1,569 | 3,929 |
| 2021 | 1,578 | 2,674 | 5,126 |
| 2022 | 1,220 | 2,284 | 4,477 |
| 2023 | 1,456 | 2,336 | 4,042 |
| 2024 | 1,072 | 1,828 | 3,587 |
| Source: Self-elaboration by author | |||
fig = px.line(years, x='year', y='total_mins',)
fig.update_layout(
title=dict(text='Spotify Played Minutes by Year', font=dict(size=30), yref='paper'),
template='ggplot2',
autosize=False,
width=830,
height=500,
showlegend=True,
plot_bgcolor='#f8f8f8',
yaxis=dict(title='Minutes'),
xaxis=dict(title=''),
margin=dict(
l=20,
r=20,
b=20,
t=50,
pad=4
)
)
fig.update_traces(line_color='#007f00')
fig.show()fig = px.line(months, x='ts', y='total_mins',)
fig.update_layout(
title=dict(text='Spotify Played Minutes by Month', font=dict(size=30), yref='paper'),
template='ggplot2',
autosize=False,
width=830,
height=500,
showlegend=True,
plot_bgcolor='#f8f8f8',
yaxis=dict(title='Minutes'),
xaxis=dict(title=''),
margin=dict(
l=20,
r=20,
b=20,
t=50,
pad=4
)
)
fig.update_traces(line_color='#007f00')
fig.show()| Year | Minutes Played |
|---|---|
| 2013 | 470 |
| 2014 | 62 |
| 2015 | 3,555 |
| 2016 | 11,857 |
| 2017 | 40,239 |
| 2018 | 28,454 |
| 2019 | 28,750 |
| 2020 | 55,239 |
| 2021 | 53,506 |
| 2022 | 38,495 |
| 2023 | 30,904 |
| 2024 | 28,962 |
| Source: Self-elaboration | |
(
GT(top_artists)
.tab_header(title=md("### Top 05 Artist by Year"))
.cols_width(
cases={
'year': '10%',
'artist_name': '100%',
}
)
.cols_label(
year='Year',
artist_name='Artist'
)
.cols_align(align='left', columns=('artist_name'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Top 05 Artist by Year |
|
|---|---|
| Year | Artist |
| 2013 | ['John Mayer', 'The Kooks', 'Lana Del Rey', 'Coldplay', 'Passion Pit'] |
| 2014 | ['The Neighbourhood', 'Blur', 'Freddie King', 'Discovery', 'Mystery Jets'] |
| 2015 | ['The Script', 'The Rolling Stones', 'David Bisbal', 'Justin Bieber', 'Frank Sinatra'] |
| 2016 | ['The Beatles', 'Led Zeppelin', 'Elvis Presley', 'Johnny Cash', 'Vampire Weekend'] |
| 2017 | ['The Beatles', 'John Mayer', 'The Killers', 'Bob Dylan', 'Radiohead'] |
| 2018 | ['The Beatles', 'Paul McCartney', 'Bob Dylan', 'John Mayer', 'The Killers'] |
| 2019 | ['The Beatles', 'Paul McCartney', 'The Killers', 'Bob Dylan', 'John Mayer'] |
| 2020 | ['The Killers', 'The Beatles', 'John Mayer', 'The Strokes', 'Bob Dylan'] |
| 2021 | ['The Beatles', 'The Killers', 'John Mayer', 'Bob Dylan', 'Kings of Leon'] |
| 2022 | ['The Beatles', 'Joaquín Sabina', 'The Killers', 'John Mayer', 'Howard Shore'] |
| 2023 | ['The Beatles', 'The Killers', 'Bob Dylan', 'John Mayer', 'Howard Shore'] |
| 2024 | ['The Beatles', 'John Mayer', 'The Killers', 'ABBA', 'Paul McCartney'] |
| Source: Self-elaboration by author | |
(
GT(top_albums)
.tab_header(title=md("### Top 05 Albums by Year"))
.cols_width(
cases={
'year': '10%',
'album_name': '100%',
}
)
.cols_label(
year='Year',
album_name='Album'
)
.cols_align(align='left', columns=('album_name'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Top 05 Albums by Year |
|
|---|---|
| Year | Album |
| 2013 | ['Inside In / Inside Out', 'Born To Die - The Paradise Edition', 'Gossamer', 'Battle Studies', 'Paper Doll'] |
| 2014 | ['Blur: The Best Of', 'Lily of the Valley', 'Bloc Party EP', 'Can’t Get Better Than This', 'The Essential Ted Nugent'] |
| 2015 | ['Tú Y Yo', 'No Sound Without Silence', 'Ultimate Sinatra', 'Somewhere Somehow', 'Save Rock And Roll'] |
| 2016 | ['Elvis At Sun', 'At Folsom Prison', 'New York', 'Ultimate Sinatra', 'The Beatles'] |
| 2017 | ['The Beatles', 'Past Masters', 'Abbey Road', 'The Wall', 'Help!'] |
| 2018 | ['The Beatles', 'Egypt Station', 'Past Masters', 'Beatles For Sale - Remastered', 'The Wall'] |
| 2019 | ['Abbey Road', 'The Beatles', 'Past Masters', 'Prismism', 'Egypt Station'] |
| 2020 | ['Imploding The Mirage', 'The New Abnormal', 'The Beatles', 'The Wall', 'Hot Fuss'] |
| 2021 | ['Pressure Machine', 'When You See Yourself', 'The Beatles', 'Abbey Road', 'Past Masters'] |
| 2022 | ['Revolver', 'The Wall', 'Pressure Machine', 'Past Masters', 'The Lord of the Rings: The Fellowship of the Ring - the Complete Recordings'] |
| 2023 | ['The Beatles', 'Abbey Road', 'Past Masters', "Sam's Town", 'Hot Fuss'] |
| 2024 | ['Arrival', 'Past Masters', 'The Beatles', 'Born and Raised', 'Paradise Valley'] |
| Source: Self-elaboration by author | |
(
GT(top_songs)
.tab_header(title=md("### Top 05 Songs by Year"))
.cols_width(
cases={
'year': '10%',
'track_name': '100%',
}
)
.cols_label(
year='Year',
track_name='Song'
)
.cols_align(align='left', columns=('track_name'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Top 05 Songs by Year |
|
|---|---|
| Year | Song |
| 2013 | ['Paper Doll', 'Mirrors', 'Born To Die', 'Young And Beautiful', 'In Your Atmosphere - Live at the Nokia Theatre, Los Angeles, CA - December 2007'] |
| 2014 | ['Wires', 'Feel It', 'Parklife', 'Wang Dang Sweet Poontang', "I Can't Win"] |
| 2015 | ['Paraíso', 'What Do You Mean?', 'Diez Mil Maneras', 'Superheroes', 'Switzerland'] |
| 2016 | ['Teenagers', 'The River', 'Not Today', 'South Bound Saurez - Remaster', "I've Got You Under My Skin"] |
| 2017 | ['Married with Children - 2014 Remaster', 'In the Blood', 'Perfect', 'The Man', 'Gravity'] |
| 2018 | ['Dominoes', 'No Words - 2010 Remaster', 'Universal Gleam', 'Band On The Run - 2010 Remaster', "I Don't Know"] |
| 2019 | ['Reminder', 'Once', "Maybe It's Time", 'Band On The Run - 2010 Remaster', 'Mariposa Traicionera'] |
| 2020 | ['Caution', 'Ode To The Mets', 'Dying Breed', 'Imploding The Mirage', 'My Own Soul’s Warning'] |
| 2021 | ['Crucify Your Mind', '100,000 People', 'Not in Nottingham', 'When You See Yourself, Are You Far Away', 'Did My Best'] |
| 2022 | ['19 Dias y 500 Noches - En Directo', 'Postdata', 'Por el Bulevar de los Sueños Rotos', 'Se Me Olvidó Otra Vez - Unplugged; 2020 Remasterizado', 'Telefonía'] |
| 2023 | ['You Sexy Thing', 'Primera Cita', 'Qué Bonito Es Querer', 'Always Alright', 'Billy 4'] |
| 2024 | ['Why Did It Have To Be Me?', 'Superdeli', 'Does Your Mother Know', 'Dancing Queen', 'Knowing Me, Knowing You'] |
| Source: Self-elaboration by author | |
(
GT(platform)
.tab_header(title=md("### Percentage of Use by Platform"))
.fmt_number(columns='len', decimals=0)
.fmt_percent('percent', decimals=2)
.cols_label(
platform='Platform',
len='Total',
percent='Percentage'
)
.cols_align(align='left', columns=('platform'))
.tab_options(table_font_size='90%')
.tab_options(table_width='90%')
.tab_source_note(source_note='Source: Self-elaboration by author')
)Percentage of Use by Platform |
||
|---|---|---|
| Platform | Total | Percentage |
| android | 139,821 | 93.30% |
| cast to device | 3,898 | 2.60% |
| iOS | 3,049 | 2.03% |
| windows | 1,691 | 1.13% |
| mac | 1,176 | 0.78% |
| web player | 225 | 0.15% |
| Source: Self-elaboration by author | ||
In this analysis, we have shown the following metrics:
Data dictionary
| Field | Description |
|---|---|
| spotify_track_uri | Spotify URI that uniquely identifies each track |
| ts | Timestamp indicating when the track stopped playing in UTC |
| platform | Platform used when streaming the track |
| ms_played | Number of milliseconds the stream was played |
| track_name | Name of the track |
| artist_name | Name of the artist |
| album_name | Name of the album |
| reason_start | Why the track started |
| reason_end | Why the track ended |
| shuffle | If shuffle mode was used when playing the track (boolean) |
| skipped | If the user skipped to the next song (boolean) |
Jesus L. Monroy
Economist & Data Scientist